Optimize sequence_gen and uniform_sequence_gen to reduce template instantiation depth #3585

tenpercent · 2026-01-16T03:17:37Z

Summary

Use compiler builtin __make_integer_seq for sequence_gen and uniform_sequence_gen
Optimize sequence_merge using direct concatenation for small cases
Reduces template instantiation depth from O(N) to O(1)

Motivation

The recursive sequence_gen_impl creates deep template chains for large sequences, increasing compile time and memory usage.

Why It Works

The compiler builtin __make_integer_seq generates the entire integer sequence in a single step internally, without recursive template instantiation. This reduces instantiation depth from O(N) to O(1), significantly cutting compile time for large sequences.

Additionally, sequence_merge now uses direct concatenation for 2-3 sequences instead of recursive merge, reducing overhead.

Build Performance Impact

Template Instantiation Reduction (measured on device_grouped_conv3d_fwd_bias_bnorm_clamp_instance target, 248 files):

Baseline: 7,748,880 total instantiations
PR Optimize sequence_gen and uniform_sequence_gen to reduce template instantiation depth #3585: 7,647,268 total instantiations
Improvement: -101,612 instantiations (1.3% reduction)

This confirms the optimization successfully reduces template instantiation overhead by using compiler builtins.

sequence_merge_impl Instantiation Analysis:

Analyzed 110,795 sequence_merge_impl instantiations across the build target:

Sequences	Instantiations	Percentage	Status
1	3,936	3.6%	Base case ✓
2	61,868	55.8%	Base case ✓
3	13,090	11.8%	Base case ✓
4	2,102	1.9%	Base case ✓
5	9,925	9.0%	Binary tree (opportunity)
6	2,806	2.5%	Binary tree
7+	13,068	11.8%	Binary tree

Current coverage: 73.1% handled by base cases (1-4 sequences)
Opportunity: Adding a 5-sequence base case would increase coverage to 82.1% (+9.0%)

The existing base cases handle the most common patterns efficiently. The 5-sequence case represents the largest remaining optimization opportunity.

Test Plan

Added 4 unit tests for sequence_gen with custom functors
Waiting for full CI

PR Stack

This PR is part of the build time optimization effort (issue #3575). All PRs now target develop independently:

#	PR	Description	Status
1	#3585	sequence_gen with `__make_integer_seq`	This PR
2	#3628	generate_identity_sequences + named functors	New (replaces #3588, #3589)
3	#3590	container_concat optimization	Updated
4	#3596	O(1) pack expansion rewrites	Updated
5	#3600	TensorDescriptor/TensorAdaptor lambda elimination	Updated

Tracking issue: #3575

shumway · 2026-01-16T17:08:59Z

Do you want to add unit tests for this, or just rely on the tests of all the code that depends on this? If it's easy to add unit tests, that's generally better, but I'm also fine with moving fast to cut down compilation times.

shumway · 2026-01-16T17:14:51Z

include/ck/utility/sequence.hpp

-// generate sequence
-template <index_t NSize, typename F>
-struct sequence_gen
+// Four sequences: direct concatenation


I like these specializations. It will be interesting to get a survey of the code to see how often the specializations are used and if these four smallest cases are the most impactful ones.

I'm using the build traces to drive the optimizations. Maybe removing the unused code is one other aspect which could help with parsing times

tenpercent · 2026-01-16T19:17:39Z

Do you want to add unit tests for this, or just rely on the tests of all the code that depends on this? If it's easy to add unit tests, that's generally better, but I'm also fine with moving fast to cut down compilation times.

let's move fast, in case something important breaks CI will catch it, tests also need maintenance and let's see how much of the metaprogramming is left after the initial sprint

…tiation depth This change significantly improves compile-time performance by reducing template instantiation depth for sequence generation and merging operations: Optimizations: - sequence_gen: Reduce instantiation depth from O(log N) to O(1) by using __make_integer_seq to generate indices in a single step, then applying the functor via pack expansion - uniform_sequence_gen: Similarly optimized to O(1) depth using __make_integer_seq with a helper that applies a constant value via pack expansion - sequence_merge: Reduce depth from O(N) to O(log N) using binary tree reduction strategy. Added direct concatenation specializations for 1-4 sequences to avoid recursion in common cases, falling back to binary tree merging for 5+ sequences Documentation: - Added extensive inline comments explaining why sequence_merge cannot achieve O(1) depth like sequence_gen (requires computing cumulative sequence lengths from heterogeneous inputs, inherently requiring recursion) - Documented the binary tree reduction approach and why it's superior to fold expressions for this use case Testing: - Added comprehensive unit tests for uniform_sequence_gen with different values, sizes, and edge cases - Added tests for sequence_gen with custom functors (double, square, identity, constant) to verify the new implementation works with arbitrary functors - Added tests for sequence_merge with 4, 5, and many sequences to verify both the direct concatenation path and binary tree reduction path - Added tests for empty sequence edge cases

shumway

This looks great. I may be misreading, but it looks like we're sill O(N) for large number of sequences, but that's not important.

My reading is that we keep merging the first pair of seqeunces, not really a binary reduction. But we could merge the first triple of sequences easily with a one-line change.

shumway · 2026-01-23T23:37:19Z

include/ck/utility/sequence.hpp

+struct sequence_merge_impl<S1, S2, S3, S4, Rest...>
+{
+    // Merge pairs first, then recurse
+    using left  = typename sequence_merge_impl<S1, S2>::type;


I may be misreading this, but since we have three hard-coded, is faster to use <S1, S2, S3> for left and <S4, Rest...> for right.

Also, my reading is that this is still O(N) when N is a large number of sequences, but we're cutting down the prefactor and hardcoding smaller numbers of sequences.

yes, good catch! covering ~70% of the instantiations with the base cases is probably the actual impact. we can follow up with it if it shows up in the build traces again

tenpercent force-pushed the tenpercent/old-ck-pack-rewrites branch from a477221 to 57c8cb1 Compare January 16, 2026 03:34

shumway reviewed Jan 16, 2026

View reviewed changes

tenpercent mentioned this pull request Jan 16, 2026

Replace O(N) recursive sequence_map_inverse with O(1) pack expansion #3596

Open

1 task

tenpercent mentioned this pull request Jan 16, 2026

Replace nested static_for lambdas with compile-time search helper #3600

Draft

2 tasks

tenpercent marked this pull request as ready for review January 17, 2026 03:41

tenpercent requested review from Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, vidyasagar-amd and vpietila-amd as code owners January 17, 2026 03:41

tenpercent force-pushed the tenpercent/old-ck-pack-rewrites branch from 57c8cb1 to 3d46680 Compare January 17, 2026 03:51

tenpercent mentioned this pull request Jan 19, 2026

Add unit tests for template optimization helpers #3610

Closed

tenpercent force-pushed the tenpercent/old-ck-pack-rewrites branch from 02dd8d6 to 3dcda67 Compare January 22, 2026 17:06

tenpercent force-pushed the tenpercent/old-ck-pack-rewrites branch from 3dcda67 to 8826238 Compare January 22, 2026 17:26

tenpercent marked this pull request as draft January 22, 2026 18:48

tenpercent marked this pull request as ready for review January 22, 2026 20:11

tenpercent marked this pull request as draft January 23, 2026 00:19

tenpercent marked this pull request as ready for review January 23, 2026 02:30

tenpercent marked this pull request as draft January 23, 2026 02:42

tenpercent marked this pull request as ready for review January 23, 2026 04:25

illsilin assigned shumway and cgmillette Jan 23, 2026

tenpercent requested review from Copilot and removed request for Copilot January 23, 2026 22:38

shumway approved these changes Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize sequence_gen and uniform_sequence_gen to reduce template instantiation depth #3585

Optimize sequence_gen and uniform_sequence_gen to reduce template instantiation depth #3585

Uh oh!

tenpercent commented Jan 16, 2026 •

edited

Loading

Uh oh!

shumway commented Jan 16, 2026

Uh oh!

shumway Jan 16, 2026

Uh oh!

tenpercent Jan 16, 2026

Uh oh!

tenpercent commented Jan 16, 2026

Uh oh!

shumway left a comment

Uh oh!

shumway Jan 23, 2026

Uh oh!

tenpercent Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Optimize sequence_gen and uniform_sequence_gen to reduce template instantiation depth #3585

Are you sure you want to change the base?

Optimize sequence_gen and uniform_sequence_gen to reduce template instantiation depth #3585

Uh oh!

Conversation

tenpercent commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Why It Works

Build Performance Impact

Test Plan

PR Stack

Uh oh!

shumway commented Jan 16, 2026

Uh oh!

shumway Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

tenpercent Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

tenpercent commented Jan 16, 2026

Uh oh!

shumway left a comment

Choose a reason for hiding this comment

Uh oh!

shumway Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

tenpercent Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tenpercent commented Jan 16, 2026 •

edited

Loading